In this fictitious case study, a bike sharing company from Chicago wants to understand how riders with an active membership status differ in their renting behavior compared to casually riding people without a membership subsrciption. Also, some general advice is appreciated.
Data analyzed is the divvy tripdata (available at https://divvy-tripdata.s3.amazonaws.com/index.html). Monthly data from between August 2021 and July 2022 was downloaded and provided via the GCP Cloud Storage and BigQuery Service. The data analyzed comprises almost 6 million rides. All 12 datasets were combined in BigQuery using a SQL union statement. No further data wrangling has taken place before, all data preparation will tke place in this notebook.
In order to make general recommendations and to answer the particular question, all data is first submitted to a sanity check and afterwards analyzed with different statistical methods, like Mann-Whitney-U-test, Kruskal-Wallis-test, and Pearson's Chi Quare test. Relationships are then visualized using different techniques, including geospatial plotting.
#Set up working environment
%pip install pandasql
%pip install researchpy
%pip install folium==0.5.0
import folium
from google.cloud.bigquery import Client, QueryJobConfig
import numpy as np
import researchpy as rp
import pandas as pd
import warnings
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import datetime as dt
from pandasql import sqldf
from scipy import stats
from scipy.stats import chi2_contingency
from scipy.stats import mannwhitneyu
import pylab
pysqldf=lambda q:sqldf(q, globals())
Requirement already satisfied: pandasql in /opt/conda/lib/python3.7/site-packages (0.7.3) Requirement already satisfied: pandas in /opt/conda/lib/python3.7/site-packages (from pandasql) (1.3.5) Requirement already satisfied: sqlalchemy in /opt/conda/lib/python3.7/site-packages (from pandasql) (1.4.42) Requirement already satisfied: numpy in /opt/conda/lib/python3.7/site-packages (from pandasql) (1.21.6) Requirement already satisfied: python-dateutil>=2.7.3 in /opt/conda/lib/python3.7/site-packages (from pandas->pandasql) (2.8.2) Requirement already satisfied: pytz>=2017.3 in /opt/conda/lib/python3.7/site-packages (from pandas->pandasql) (2022.5) Requirement already satisfied: importlib-metadata in /opt/conda/lib/python3.7/site-packages (from sqlalchemy->pandasql) (4.11.4) Requirement already satisfied: greenlet!=0.4.17 in /opt/conda/lib/python3.7/site-packages (from sqlalchemy->pandasql) (1.1.3) Requirement already satisfied: six>=1.5 in /opt/conda/lib/python3.7/site-packages (from python-dateutil>=2.7.3->pandas->pandasql) (1.16.0) Requirement already satisfied: typing-extensions>=3.6.4 in /opt/conda/lib/python3.7/site-packages (from importlib-metadata->sqlalchemy->pandasql) (4.4.0) Requirement already satisfied: zipp>=0.5 in /opt/conda/lib/python3.7/site-packages (from importlib-metadata->sqlalchemy->pandasql) (3.10.0) Note: you may need to restart the kernel to use updated packages. Requirement already satisfied: researchpy in /opt/conda/lib/python3.7/site-packages (0.3.5) Requirement already satisfied: pandas in /opt/conda/lib/python3.7/site-packages (from researchpy) (1.3.5) Requirement already satisfied: patsy in /opt/conda/lib/python3.7/site-packages (from researchpy) (0.5.3) Requirement already satisfied: numpy in /opt/conda/lib/python3.7/site-packages (from researchpy) (1.21.6) Requirement already satisfied: scipy in /opt/conda/lib/python3.7/site-packages (from researchpy) (1.7.3) Requirement already satisfied: statsmodels in /opt/conda/lib/python3.7/site-packages (from researchpy) (0.13.2) Requirement already satisfied: python-dateutil>=2.7.3 in /opt/conda/lib/python3.7/site-packages (from pandas->researchpy) (2.8.2) Requirement already satisfied: pytz>=2017.3 in /opt/conda/lib/python3.7/site-packages (from pandas->researchpy) (2022.5) Requirement already satisfied: six in /opt/conda/lib/python3.7/site-packages (from patsy->researchpy) (1.16.0) Requirement already satisfied: packaging>=21.3 in /opt/conda/lib/python3.7/site-packages (from statsmodels->researchpy) (21.3) Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /opt/conda/lib/python3.7/site-packages (from packaging>=21.3->statsmodels->researchpy) (3.0.9) Note: you may need to restart the kernel to use updated packages. Requirement already satisfied: folium==0.5.0 in /opt/conda/lib/python3.7/site-packages (0.5.0) Requirement already satisfied: requests in /opt/conda/lib/python3.7/site-packages (from folium==0.5.0) (2.28.1) Requirement already satisfied: branca in /opt/conda/lib/python3.7/site-packages (from folium==0.5.0) (0.6.0) Requirement already satisfied: jinja2 in /opt/conda/lib/python3.7/site-packages (from folium==0.5.0) (3.1.2) Requirement already satisfied: six in /opt/conda/lib/python3.7/site-packages (from folium==0.5.0) (1.16.0) Requirement already satisfied: MarkupSafe>=2.0 in /opt/conda/lib/python3.7/site-packages (from jinja2->folium==0.5.0) (2.1.1) Requirement already satisfied: idna<4,>=2.5 in /opt/conda/lib/python3.7/site-packages (from requests->folium==0.5.0) (3.4) Requirement already satisfied: urllib3<1.27,>=1.21.1 in /opt/conda/lib/python3.7/site-packages (from requests->folium==0.5.0) (1.26.11) Requirement already satisfied: charset-normalizer<3,>=2 in /opt/conda/lib/python3.7/site-packages (from requests->folium==0.5.0) (2.1.1) Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.7/site-packages (from requests->folium==0.5.0) (2022.9.24) Note: you may need to restart the kernel to use updated packages.
#Importing data from BigQuery and providing a brief summary
#Note: This Notebook was originally hosted on Google Cloud Vertex AI Workbench
client = Client()
query = """SELECT * FROM `flowing-access-365407.data2020.complete21-22` """
job = client.query(query)
df = job.to_dataframe()
df.info()
df.head()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5901463 entries, 0 to 5901462 Data columns (total 13 columns): # Column Dtype --- ------ ----- 0 ride_id object 1 rideable_type object 2 started_at datetime64[ns, UTC] 3 ended_at datetime64[ns, UTC] 4 start_station_name object 5 start_station_id object 6 end_station_name object 7 end_station_id object 8 start_lat float64 9 start_lng float64 10 end_lat float64 11 end_lng float64 12 member_casual object dtypes: datetime64[ns, UTC](2), float64(4), object(7) memory usage: 585.3+ MB
| ride_id | rideable_type | started_at | ended_at | start_station_name | start_station_id | end_station_name | end_station_id | start_lat | start_lng | end_lat | end_lng | member_casual | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | E5A1BFF51EA756C2 | electric_bike | 2022-07-03 15:44:07+00:00 | 2022-07-03 15:57:38+00:00 | None | None | None | None | 41.98 | -87.65 | 42.0 | -87.66 | casual |
| 1 | 1E373F1BD866271C | electric_bike | 2021-10-10 18:59:43+00:00 | 2021-10-10 19:04:42+00:00 | None | None | None | None | 41.99 | -87.68 | 42.0 | -87.68 | casual |
| 2 | A54D606AC1CEA7A0 | electric_bike | 2021-08-13 17:10:07+00:00 | 2021-08-13 17:15:21+00:00 | None | None | None | None | 41.99 | -87.69 | 42.0 | -87.69 | member |
| 3 | 86FF6A9414412563 | electric_bike | 2022-01-05 12:18:27+00:00 | 2022-01-05 13:05:14+00:00 | None | None | None | None | 42.00 | -87.68 | 42.0 | -87.68 | casual |
| 4 | F7A435013644F529 | electric_bike | 2022-05-09 19:23:38+00:00 | 2022-05-09 19:31:52+00:00 | None | None | None | None | 42.02 | -87.71 | 42.0 | -87.71 | casual |
#Next, some time wrangling is applied.
#First, the ride length is computed.
df['ride_length'] = df['ended_at'] - df['started_at']
df['ride_length_min']=df.ride_length.dt.total_seconds()/60 #ride lengths in minutes
#Then, variables containing the month, weekday, and time of day in hour of each rent are computed.
df['hour'] = df['started_at'].dt.hour
df['weekday'] = df['started_at'].dt.dayofweek #0 means monday
df['month'] = df['started_at'].dt.month
df.isnull().sum()
ride_id 0 rideable_type 0 started_at 0 ended_at 0 start_station_name 860786 start_station_id 860784 end_station_name 919896 end_station_id 919896 start_lat 0 start_lng 0 end_lat 5590 end_lng 5590 member_casual 0 ride_length 0 ride_length_min 0 hour 0 weekday 0 month 0 dtype: int64
Besides start and end station names and id, end latitude and longitude, there are no missings in the data frame. This is not a problem at all, as all data will be analyzed per start of rents considering the geographic coordinates. Also, as data will be plotted in a map, the names of stations will not be of interest. The same applies for station IDs.
#Membership status
display(df['member_casual'].value_counts())
df['member_casual'].value_counts().plot(kind='bar', color='indianred')
member 3379237 casual 2522226 Name: member_casual, dtype: int64
<AxesSubplot:>
Apparently, there are more member than casual riders. But the database is rich enough to perform analyses of differences between both groups.
#Bike type
display(df['rideable_type'].value_counts())
df['rideable_type'].value_counts().plot(kind='bar', color='indianred')
classic_bike 3055641 electric_bike 2619094 docked_bike 226728 Name: rideable_type, dtype: int64
<AxesSubplot:>
Assuming that the company offers more classic bikes than electric or docked bikes for rent, the distribution is not surprising. And all data of docked bikes remains in the dataset.
df_m=df[['rideable_type', 'ride_length_min']].groupby(['rideable_type']).mean()
df_s=df[['rideable_type', 'ride_length_min']].groupby(['rideable_type']).std()
# Create lists for the plot
bikes = ['Classic', 'Docked', 'Electric']
x_pos = np.arange(len(bikes))
Ms = [df_m.iat[0,0], df_m.iat[1,0], df_m.iat[2,0]]
error = [df_s.iat[0,0], df_s.iat[1,0], df_s.iat[2,0]]
# Build the plot
fig, ax = plt.subplots()
ax.bar(x_pos, Ms, yerr=error, align='center', ecolor='black', capsize=5, color=['grey', 'indianred', 'grey'])
ax.set_xticks(x_pos)
ax.set_xticklabels(bikes)
ax.set_ylim([0,120])
plt.title('Average ride length in minutes and standard deviation for different bike types')
plt.tight_layout()
plt.show()
On average, rented dock bikes are ridden the longest.
#Create a map of centered around the US
rent_map = folium.Map(location=[39.74, -104.99], zoom_start=4, tiles='CartoDB positron')
#As the dataset is too large for folium to process, all latitudes and longitudes are rounded with 3 decimal places
#and then grouped to shorten the df. The datapoints are then plotted and weighted by their occurency to account for
#geospatial accumulations in different locations.
df['round_start_lat']=round(df['start_lat'],3)
df['round_start_lng']=round(df['start_lng'],3)
df_map=df[['round_start_lat', 'round_start_lng','ride_length_min']].groupby(by=['round_start_lat', 'round_start_lng']).count()
df_map.rename(columns={'ride_length_min':'frequency'}, inplace=True)
df_map.reset_index(inplace=True)
#Initiate a feature group for the rents in the dataframe
rents = folium.map.FeatureGroup()
#Loop through the rents and add each to the rents feature group
for lat, lng, size in zip(df_map.round_start_lat, df_map.round_start_lng, df_map.frequency):
rents.add_child(
folium.features.CircleMarker(
[lat, lng],
radius=size/4000,
color='indianred',
fill=True,
fill_color='indianred',
fill_opacity=0.7
)
)
#Add rents to map
rent_map.add_child(rents)
The map mainly shows rents in Chicago, and a few in Montreal. As climate, rent policies, marketing, and customer base, etc. might be different in Canada, this analysis will focus on the Chicago region. The following boxplots show the distribution of start latitudes and longitudes before and after the removal of cases outside of the Chicago region.
df.boxplot(column="start_lat")
plt.title('Before outlier removal')
plt.show()
df.drop(df[df.start_lat > 42.5].index, inplace=True)
df.boxplot(column="start_lat")
plt.title('After outlier removal')
Text(0.5, 1.0, 'After outlier removal')
df.boxplot(column="start_lng")
plt.title('Before outlier removal')
plt.show()
df.drop(df[df.start_lng > -86].index, inplace=True)
df.boxplot(column="start_lng")
plt.title('After outlier removal')
Text(0.5, 1.0, 'After outlier removal')
#Ride length
print("All rides were between", df.ride_length_min.min(), "minutes and",df.ride_length_min.max()/60/24, "days long.")
All rides were between -137.41666666666666 minutes and 28.90914351851852 days long.
This variable clearly holds some errors. Negative values will be replaced by mean imputation. However, long-term rentals are not necessarily a data error, so they are left in the data set. Nevertheless, to control for a possible bias, a quartile split is applied, dividing the data set into long-term and short-term rentals, so that more differentiated findings are possible. The cut-offs are set to the 25th and 75th quartile.
m = df.ride_length_min.mean()
df['ride_length_min'].mask(df['ride_length_min'] < 0, m, inplace=True)
quart = df['ride_length_min'].quantile([.25, .5, .75])
display(quart)
df['short_rides']=df['ride_length_min']
df['short_rides'].mask(df['short_rides'] > quart[0.25], inplace=True)
df['medium_rides']=df['ride_length_min']
df['medium_rides'].mask(df['medium_rides'] > quart[0.75], inplace=True)
df['medium_rides'].mask(df['medium_rides'] < quart[0.25], inplace=True)
df['long_rides']=df['ride_length_min']
df['long_rides'].mask(df['long_rides'] < quart[0.75], inplace=True)
0.25 6.166667 0.50 10.950000 0.75 19.816667 Name: ride_length_min, dtype: float64
50% of the ride lengths are 6 to 20 minutes long. Thus, this time span will be used to generate insights into the usual rents. Rents shorter than 6 minutes are considered ultra short term rents, rents longer than 20 minutes will be considered long term rents.
fig = plt.figure()
ax0 = fig.add_subplot(131)
df.boxplot(column="short_rides", ax=ax0)
ax1 = fig.add_subplot(132)
df.boxplot(column="medium_rides", ax=ax1)
ax2 = fig.add_subplot(133)
df.boxplot(column="long_rides", ax=ax2)
fig.tight_layout()
plt.show()
The boxplot show no outliers for short and medium rides. Long rental periods show many extreme values. Nevertheless, these are left in the data because, as stated above, the long rental periods are quite plausible and could even be in the company's interest. Lastly, the three subcategories and overall ride length are checked for normal distribution to determine further statistical tests.
stats.probplot(df['ride_length_min'][df['member_casual'] == 'casual'], dist="norm", plot=pylab)
plt.title('PP Plot for overall ride length of casual riders')
pylab.show()
stats.probplot(df['ride_length_min'][df['member_casual'] == 'member'], dist="norm", plot=pylab)
plt.title('PP Plot for overall ride length of member riders')
pylab.show()
stats.probplot(df['short_rides'][df['member_casual'] == 'casual'], dist="norm", plot=pylab)
plt.title('PP Plot for short rides of casual riders')
pylab.show()
stats.probplot(df['short_rides'][df['member_casual'] == 'member'], dist="norm", plot=pylab)
plt.title('PP Plot for short rides of member riders')
pylab.show()
stats.probplot(df['medium_rides'][df['member_casual'] == 'casual'], dist="norm", plot=pylab)
plt.title('PP Plot for medium rides of casual riders')
pylab.show()
stats.probplot(df['medium_rides'][df['member_casual'] == 'member'], dist="norm", plot=pylab)
plt.title('PP Plot for medium rides of member riders')
pylab.show()
stats.probplot(df['long_rides'][df['member_casual'] == 'casual'], dist="norm", plot=pylab)
plt.title('PP Plot for long rides of casual riders')
pylab.show()
stats.probplot(df['long_rides'][df['member_casual'] == 'member'], dist="norm", plot=pylab)
plt.title('PP Plot for long rides of member riders')
pylab.show()
All four variables are not normally distributed, neither in the member nor in the casual group. Therefore, nonparametric tests are performed in the analysis section.
#Month
print("All timestamps lie between", df.started_at.min(), "and",df.started_at.max())
df.boxplot(column="month")
plt.title('Months - Before outlier removal')
plt.show()
df.boxplot(column="weekday")
plt.title('Weekdays - Before outlier removal')
plt.show()
df.boxplot(column="hour")
plt.title('Time of day - Before outlier removal')
plt.show()
All timestamps lie between 2021-08-01 00:00:04+00:00 and 2022-07-31 23:59:58+00:00
All month, weekday and hour codes seem to be computed correctly (see correlation matrix under Exploratory Analysis for further confirmation) and lie within the correct time span.
#Ride length
df['ride_length_min'].hist(color='indianred')
<AxesSubplot:>
Most rides were considerably short. This fits the quartile analysis above.
df_vg = df[['month', 'ride_id']].groupby(by='month').count().reset_index()
df_vg.plot(x='month', kind='bar', color='indianred', legend=False)
<AxesSubplot:xlabel='month'>
This very quick first impression shows that customers rode bikes fairly more often during sommer months.
df_vg = df[['weekday', 'ride_id']].groupby(by='weekday').count().reset_index()
df_vg.plot(x='weekday', kind='bar', color='indianred', legend=False)
<AxesSubplot:xlabel='weekday'>
There are slitghtly more rides on weekends, most rides are on Saturdays (0 means Monday).
df_vg = df[['hour', 'ride_id']].groupby(by='hour').count().reset_index()
df_vg.plot(x='hour', kind='bar', color='indianred', legend=False)
<AxesSubplot:xlabel='hour'>
As could be expected, rentals are peaking at rush hour time (4 to 6pm). Consequently, bike maintanance should take place in the morning.
#The long rides variable will distort the shape of the plot.
#Therefore, all rents longer than an hour will be set to 60 minutes for this visualization
warnings.filterwarnings("ignore")
df1=df[['started_at', 'long_rides']]
df1['long_rides'].mask(df1['long_rides'] >60, 60, inplace=True)
#Building scatter plot
fig, ax = plt.subplots()
fig.set_size_inches(5,10)
ax.scatter(df['started_at'], df['short_rides'], s=0.01, c='slategrey', alpha=0.03, vmin=0, vmax=60)
ax.scatter(df['started_at'], df['medium_rides'], s=0.01, c='skyblue', alpha=0.03, vmin=0, vmax=60)
ax.scatter(df1['started_at'], df1['long_rides'], s=0.01, c='cadetblue', alpha=0.03, vmin=0, vmax=60)
ax.legend()
plt.show()
No artists with labels found to put in legend. Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
The plot above shows that in winter months rides are shorter and fewer rides are taken. The darker the color, the more bikes were rented. Rents longer than 60 minutes are depicted by the dark red line on top.
The following table displays descreptive statistics of all variables in the dataframe, including continuous and categorical variables.
df.describe(include='all', datetime_is_numeric=True).round(decimals=2)
| ride_id | rideable_type | started_at | ended_at | start_station_name | start_station_id | end_station_name | end_station_id | start_lat | start_lng | ... | ride_length | ride_length_min | hour | weekday | month | round_start_lat | round_start_lng | short_rides | medium_rides | long_rides | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 5901462 | 5901462 | 5901462 | 5901462 | 5040676 | 5040678 | 4981566 | 4981566 | 5901462.00 | 5901462.00 | ... | 5901462 | 5901462.00 | 5901462.00 | 5901462.00 | 5901462.00 | 5901462.00 | 5901462.00 | 1477516.00 | 2955827.00 | 1475917.00 |
| unique | 5901462 | 3 | NaN | NaN | 1380 | 1225 | 1395 | 1235 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| top | E5A1BFF51EA756C2 | classic_bike | NaN | NaN | Streeter Dr & Grand Ave | 13022 | Streeter Dr & Grand Ave | 13022 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| freq | 1 | 3055641 | NaN | NaN | 80414 | 80414 | 80767 | 80767 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| mean | NaN | NaN | 2022-01-31 21:50:42.494991872+00:00 | 2022-01-31 22:10:35.871607296+00:00 | NaN | NaN | NaN | NaN | 41.90 | -87.65 | ... | 0 days 00:19:53.376616675 | 19.89 | 14.21 | 3.13 | 7.24 | 41.90 | -87.65 | 3.86 | 11.58 | 52.52 |
| min | NaN | NaN | 2021-08-01 00:00:04+00:00 | 2021-08-01 00:03:11+00:00 | NaN | NaN | NaN | NaN | 41.64 | -87.84 | ... | -1 days +21:42:35 | 0.00 | 0.00 | 0.00 | 1.00 | 41.64 | -87.84 | 0.00 | 6.17 | 19.82 |
| 25% | NaN | NaN | 2021-09-27 12:35:11.249999872+00:00 | 2021-09-27 12:54:00.750000128+00:00 | NaN | NaN | NaN | NaN | 41.88 | -87.66 | ... | 0 days 00:06:10 | 6.17 | 11.00 | 1.00 | 5.00 | 41.88 | -87.66 | 2.87 | 8.35 | 23.97 |
| 50% | NaN | NaN | 2022-02-14 14:10:11.500000+00:00 | 2022-02-14 14:20:26+00:00 | NaN | NaN | NaN | NaN | 41.90 | -87.64 | ... | 0 days 00:10:57 | 10.95 | 15.00 | 3.00 | 7.00 | 41.90 | -87.64 | 4.10 | 10.95 | 30.28 |
| 75% | NaN | NaN | 2022-06-05 15:29:40.750000128+00:00 | 2022-06-05 15:54:48.500000+00:00 | NaN | NaN | NaN | NaN | 41.93 | -87.63 | ... | 0 days 00:19:49 | 19.82 | 18.00 | 5.00 | 9.00 | 41.93 | -87.63 | 5.15 | 14.43 | 43.35 |
| max | NaN | NaN | 2022-07-31 23:59:58+00:00 | 2022-08-04 13:53:01+00:00 | NaN | NaN | NaN | NaN | 42.07 | -87.52 | ... | 28 days 21:49:10 | 41629.17 | 23.00 | 6.00 | 12.00 | 42.07 | -87.52 | 6.17 | 19.82 | 41629.17 |
| std | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.05 | 0.03 | ... | 0 days 02:27:59.110178684 | 147.98 | 5.04 | 2.00 | 2.57 | 0.05 | 0.03 | 1.58 | 3.76 | 293.39 |
11 rows × 23 columns
The next table in the section below depicts all variable correlations. Correlations range between -1 (strong negative relationship) and 1 (strong positive relationship), with 0 indicating no relationship at all. To visualize the correlations from the table, the heatmap below also gives correlation coefficients of all variables but pigments the cells dependent on the size and direction of the correlation coefficent. Spearman correlation is performed as not all variables are normally distributed (see sections below).
corr= df.corr(method='spearman').round(decimals=2)
corr
| start_lat | start_lng | end_lat | end_lng | ride_length_min | hour | weekday | month | round_start_lat | round_start_lng | short_rides | medium_rides | long_rides | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| start_lat | 1.00 | -0.57 | 0.83 | -0.50 | 0.03 | 0.03 | 0.06 | -0.01 | 1.00 | -0.57 | 0.02 | 0.03 | -0.09 |
| start_lng | -0.57 | 1.00 | -0.49 | 0.73 | 0.05 | -0.01 | -0.00 | 0.00 | -0.57 | 1.00 | -0.02 | -0.01 | 0.16 |
| end_lat | 0.83 | -0.49 | 1.00 | -0.57 | 0.04 | 0.07 | 0.05 | -0.01 | 0.83 | -0.49 | 0.02 | 0.04 | -0.08 |
| end_lng | -0.50 | 0.73 | -0.57 | 1.00 | 0.05 | -0.06 | -0.00 | 0.00 | -0.50 | 0.73 | -0.02 | -0.01 | 0.17 |
| ride_length_min | 0.03 | 0.05 | 0.04 | 0.05 | 1.00 | 0.04 | 0.11 | -0.00 | 0.03 | 0.05 | 1.00 | 1.00 | 1.00 |
| hour | 0.03 | -0.01 | 0.07 | -0.06 | 0.04 | 1.00 | -0.04 | -0.02 | 0.03 | -0.01 | 0.01 | 0.03 | -0.03 |
| weekday | 0.06 | -0.00 | 0.05 | -0.00 | 0.11 | -0.04 | 1.00 | 0.02 | 0.06 | -0.00 | -0.01 | 0.05 | 0.06 |
| month | -0.01 | 0.00 | -0.01 | 0.00 | -0.00 | -0.02 | 0.02 | 1.00 | -0.01 | 0.00 | 0.02 | -0.00 | -0.01 |
| round_start_lat | 1.00 | -0.57 | 0.83 | -0.50 | 0.03 | 0.03 | 0.06 | -0.01 | 1.00 | -0.57 | 0.02 | 0.03 | -0.09 |
| round_start_lng | -0.57 | 1.00 | -0.49 | 0.73 | 0.05 | -0.01 | -0.00 | 0.00 | -0.57 | 1.00 | -0.02 | -0.01 | 0.16 |
| short_rides | 0.02 | -0.02 | 0.02 | -0.02 | 1.00 | 0.01 | -0.01 | 0.02 | 0.02 | -0.02 | 1.00 | NaN | NaN |
| medium_rides | 0.03 | -0.01 | 0.04 | -0.01 | 1.00 | 0.03 | 0.05 | -0.00 | 0.03 | -0.01 | NaN | 1.00 | NaN |
| long_rides | -0.09 | 0.16 | -0.08 | 0.17 | 1.00 | -0.03 | 0.06 | -0.01 | -0.09 | 0.16 | NaN | NaN | 1.00 |
mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True
with sns.axes_style("white"):
ax = sns.heatmap(corr, annot=True, cmap="RdYlGn", vmin=-1, vmax=1)
plt.show()
The heatmap and correlation table do not reveal any strong relationships between variables that wouldn't have been expected - start and end latitudes and longitudes are naturally correlated as people don't go as far by bike. All other variables do not seem to systematically correlate. Now, let's dive into the project question and examine different behaviors of casual riders and members.
df[['ride_length_min', 'short_rides', 'medium_rides', 'long_rides']].groupby(df['member_casual']).mean().round(2)
| ride_length_min | short_rides | medium_rides | long_rides | |
|---|---|---|---|---|
| member_casual | ||||
| casual | 29.21 | 3.88 | 12.02 | 63.39 |
| member | 12.93 | 3.86 | 11.27 | 35.23 |
Looking at the numbers, members mainly seem to take less long rides than non-members. Let's back it up by statistics and visualization for all four subcategories - global ride length, short rides shorter than 6 minutes, medium rides, and long rides longer than 20 minutes.
df[['ride_length_min']].groupby(df['member_casual']).mean().reset_index()
| member_casual | ride_length_min | |
|---|---|---|
| 0 | casual | 29.212621 |
| 1 | member | 12.932889 |
#Overall ride length
fig = plt.figure()
fig.set_size_inches(20, 5)
ax0 = fig.add_subplot(131)
df1=df[['ride_length_min']].groupby(df['member_casual']).mean().reset_index()
ax0.bar(df1['member_casual'], df1['ride_length_min'], color=['indianred', 'grey'])
plt.title('Average length of rents by members and casual riders in minutes')
ax1=fig.add_subplot(132)
df[['ride_length_min']][df['member_casual'] == 'casual'].hist(ax=ax1, color=['indianred'])
plt.title('Rent minutes of casual riders')
ax2=fig.add_subplot(133)
df[['ride_length_min']][df['member_casual'] == 'member'].hist(ax=ax2, color=['grey'])
plt.title('Rent minutes of members')
plt.show()
#Mann-Whitney-U-test
mannwhitneyu(df['ride_length_min'][df['member_casual'] == 'member'], df['ride_length_min'][df['member_casual'] == 'casual'])
MannwhitneyuResult(statistic=2923851693636.5, pvalue=0.0)
The plots show that on average, rents by casual riders are more than double as long as rents by members on average. As the ride length is not normally distributed in both groups (see sanity checks above; Data is not linear for the theoretical quantiles), a Mann-Whitney-U-Test is performed next. This is the equivalent for an unpaired t-test for nonparametric data. Considering the overall duration of rents, rents by non-members are indeed significantly longer. This will be further examined for the categories of short, medium and long-term rents.
#Short rides
fig = plt.figure()
fig.set_size_inches(20, 5)
ax0 = fig.add_subplot(131)
df1 = df[['short_rides']].groupby(df['member_casual']).mean().reset_index()
ax0.bar(df1['member_casual'], df1['short_rides'], color=['indianred', 'grey'])
plt.title('Average length of short term rents by members and casual riders in minutes')
ax1=fig.add_subplot(132)
df[['short_rides']][df['member_casual'] == 'casual'].hist(ax=ax1, color=['indianred'])
plt.title('Short term rent minutes of casual riders')
ax2=fig.add_subplot(133)
df[['short_rides']][df['member_casual'] == 'member'].hist(ax=ax2, color=['grey'])
plt.title('Short term rent minutes of members')
plt.show()
#Mann-Whitney-U-test
df1=df.loc[df['short_rides'] >= 0]
statistic, p = mannwhitneyu(df1['short_rides'][df1['member_casual'] == 'member'], df1['short_rides'][df1['member_casual'] == 'casual'])
print("Test statistic:", round(statistic,1),", p value:", round(p, 2))
Test statistic: 205055185458.5 , p value: 0.0
For short rides, the mean length of ride minutes does not really differ between casual and member riders. However, both groups differ significantly. As the histograms reveal, members take significantly shorter rides than casual riders in the short term rent domain.
#Medium rides
fig = plt.figure()
fig.set_size_inches(20, 5)
ax0 = fig.add_subplot(131)
df1 = df[['medium_rides']].groupby(df['member_casual']).mean().reset_index()
ax0.bar(df1['member_casual'], df1['medium_rides'], color=['indianred', 'grey'])
plt.title('Average length of medium term rents by members and casual riders in minutes')
ax1=fig.add_subplot(132)
df[['medium_rides']][df['member_casual'] == 'casual'].hist(ax=ax1, color=['indianred'])
plt.title('Medium term rent minutes of casual riders')
ax2=fig.add_subplot(133)
df[['medium_rides']][df['member_casual'] == 'member'].hist(ax=ax2, color=['grey'])
plt.title('Medium term rent minutes of members')
plt.show()
#Mann-Whitney-U-test
df1=df.loc[df['medium_rides'] >= 0]
statistic, p = mannwhitneyu(df1['medium_rides'][df1['member_casual'] == 'member'], df1['medium_rides'][df1['member_casual'] == 'casual'])
print("Test statistic:", round(statistic,1),", p value:", round(p, 2))
Test statistic: 935019027555.5 , p value: 0.0
In this category, too, members take significantly shorter rides than non-members.
#Long rides
fig = plt.figure()
fig.set_size_inches(20, 5)
ax0 = fig.add_subplot(131)
df1 = df[['long_rides']].groupby(df['member_casual']).mean().reset_index()
ax0.bar(df1['member_casual'], df1['long_rides'], color=['indianred', 'grey'])
plt.title('Average length of long term rents by members and casual riders in minutes')
ax1=fig.add_subplot(132)
df[['long_rides']][df['member_casual'] == 'casual'].hist(ax=ax1, color=['indianred'])
plt.title('Long term rent minutes of casual riders')
ax2=fig.add_subplot(133)
df[['long_rides']][df['member_casual'] == 'member'].hist(ax=ax2, color=['grey'])
plt.title('Long term rent minutes of members')
plt.show()
#Mann-Whitney-U-test
df1=df.loc[df['long_rides'] >= 0]
statistic, p = mannwhitneyu(df1['long_rides'][df1['member_casual'] == 'member'], df1['long_rides'][df1['member_casual'] == 'casual'])
print("Test statistic:", round(statistic,1),", p value:", round(p, 2))
Test statistic: 189280646851.5 , p value: 0.0
In this category, it is most noticeable that members rent shorter on average. Knowing that rents by members are shorter, the amount of rides in each category will now be examined using a Person's Chi Square test to not only check the distribution of rental minutes, but also the amount of rents. All assumptions of this test are met.
#computing contingency table
cont_cat = df[['short_rides', 'medium_rides', 'long_rides']].groupby(df['member_casual']).count()
display(cont_cat)
#computing test statistic
stat, p, dof, expected = chi2_contingency(cont_cat)
alpha = 0.05
print("p value is " + str(p))
if p <= alpha:
print('Dependent (reject H0) - the variables have a significant relation')
else:
print('Independent (H0 holds true)')
| short_rides | medium_rides | long_rides | |
|---|---|---|---|
| member_casual | |||
| casual | 393265 | 1225376 | 906379 |
| member | 1084251 | 1730451 | 569538 |
p value is 0.0 Dependent (reject H0) - the variables have a significant relation
The Chi Square test reveals that the ride-length categories are indeed differently distributed among members and casual riders. The following visualization facilitates a better understanding:
df[['short_rides', 'medium_rides', 'long_rides', 'ride_length_min']].groupby(df['member_casual']).count().plot(kind='bar',
color=['slategrey', 'skyblue', 'cadetblue', 'indianred'])
plt.legend(loc="upper left", labels=['short','medium','long','total'])
plt.title('Amount of rents per category')
Text(0.5, 1.0, 'Amount of rents per category')
This bar chart confirms what we already knew; Members have particularly few long-term rentals and particularly many short-term rentals compared to the number of total rentals
fig = plt.figure()
fig.set_size_inches(10,5)
ax0 = fig.add_subplot(121)
df1=df[['hour','ride_length_min']][df['member_casual']=='casual'].groupby(['hour']).count().reset_index()
ax0.bar(df1['hour'], df1['ride_length_min'], color=['grey','grey','grey','grey','grey','grey','grey','grey','grey','grey','grey','grey','grey','grey','grey','indianred','indianred','indianred','indianred',
'grey','grey','grey','grey','grey'])
plt.title('Amount of casual rides per time of day')
ax1= fig.add_subplot(122)
df1=df[['hour','ride_length_min']][df['member_casual']=='member'].groupby(['hour']).count().reset_index()
ax1.bar(df1['hour'], df1['ride_length_min'], color=['grey','grey','grey','grey','grey','grey','grey','indianred','indianred','grey','grey','grey','grey','grey','grey','grey','indianred','indianred','indianred','grey',
'grey','grey','grey','grey'])
plt.title('Amount of member rides per time of day')
fig.tight_layout()
plt.show()
#unpaired t-test
summary, results = rp.ttest(group1= df['hour'][df['member_casual'] == 'member'], group1_name= "Member",
group2= df['hour'][df['member_casual'] == 'casual'], group2_name= "Casual")
display(results.iloc[0:4])
| Independent t-test | results | |
|---|---|---|
| 0 | Difference (Member - Casual) = | -5.467000e-01 |
| 1 | Degrees of freedom = | 5.901460e+06 |
| 2 | t = | -1.306325e+02 |
| 3 | Two side test p value = | 0.000000e+00 |
Again, the distribution of member and casual riders differs significantly. As the variables are approximately normally distributed in both groups, an unpaired t-test was performed. Member rides show a small peak at the morning rush hour (7 to 9 am) and a strong peak in the afternoon (4 to 7pm), while casual rides steadily increase across the day and peak between 3 and 7 pm. The plot above shows amount of rents. Average length of rents will be examined next.
#Average ride length for members and casual riders by time of day
fig=plt.figure()
fig.set_size_inches(10,5)
ax0=fig.add_subplot(121)
df[['member_casual', 'hour', 'short_rides', 'medium_rides', 'long_rides', 'ride_length_min']][df['member_casual']=='casual'].groupby(['hour']).mean().plot(
kind='line', color=['slategrey', 'skyblue', 'cadetblue', 'indianred'], ax=ax0)
plt.title('Ride length for casual riders by time of day')
ax0.get_legend().remove()
plt.ylim([0,70])
ax1=fig.add_subplot(122)
df[['member_casual', 'hour', 'short_rides', 'medium_rides', 'long_rides', 'ride_length_min']][df['member_casual']=='member'].groupby(['hour']).mean().plot(
kind='line', color=['slategrey', 'skyblue', 'cadetblue', 'indianred'], ax=ax1)
plt.legend(labels=['short','medium','long','total'], loc='upper right')
plt.title('Ride length for member riders by time of day')
plt.ylim([0,70])
fig.tight_layout()
plt.show()
#Kruskal-Wallis-test casual riders
df1=df[['hour', 'ride_length_min']][df['member_casual']=='casual'].pivot(columns='hour', values='ride_length_min')
df1[0].fillna(df1[0].mean(), inplace=True)
df1[1].fillna(df1[1].mean(), inplace=True)
df1[2].fillna(df1[2].mean(), inplace=True)
df1[3].fillna(df1[3].mean(), inplace=True)
df1[4].fillna(df1[4].mean(), inplace=True)
df1[5].fillna(df1[5].mean(), inplace=True)
df1[6].fillna(df1[6].mean(), inplace=True)
df1[7].fillna(df1[7].mean(), inplace=True)
df1[8].fillna(df1[8].mean(), inplace=True)
df1[9].fillna(df1[9].mean(), inplace=True)
df1[10].fillna(df1[10].mean(), inplace=True)
df1[11].fillna(df1[11].mean(), inplace=True)
df1[12].fillna(df1[12].mean(), inplace=True)
df1[13].fillna(df1[13].mean(), inplace=True)
df1[14].fillna(df1[14].mean(), inplace=True)
df1[15].fillna(df1[15].mean(), inplace=True)
df1[16].fillna(df1[16].mean(), inplace=True)
df1[17].fillna(df1[17].mean(), inplace=True)
df1[18].fillna(df1[18].mean(), inplace=True)
df1[19].fillna(df1[19].mean(), inplace=True)
df1[20].fillna(df1[20].mean(), inplace=True)
df1[21].fillna(df1[21].mean(), inplace=True)
df1[22].fillna(df1[22].mean(), inplace=True)
df1[23].fillna(df1[23].mean(), inplace=True)
stats.kruskal(df1[0], df1[1], df1[2], df1[3], df1[4], df1[5], df1[6], df1[7], df1[8], df1[9], df1[10], df1[11], df1[12],
df1[13], df1[14], df1[15], df1[16], df1[17], df1[18], df1[19], df1[20], df1[21], df1[22], df1[23])
KruskalResult(statistic=53495134.53637219, pvalue=0.0)
#Kruskal-Wallis-test members
df1=df[['hour', 'ride_length_min']][df['member_casual']=='member'].pivot(columns='hour', values='ride_length_min')
df1[0].fillna(df1[0].mean(), inplace=True)
df1[1].fillna(df1[1].mean(), inplace=True)
df1[2].fillna(df1[2].mean(), inplace=True)
df1[3].fillna(df1[3].mean(), inplace=True)
df1[4].fillna(df1[4].mean(), inplace=True)
df1[5].fillna(df1[5].mean(), inplace=True)
df1[6].fillna(df1[6].mean(), inplace=True)
df1[7].fillna(df1[7].mean(), inplace=True)
df1[8].fillna(df1[8].mean(), inplace=True)
df1[9].fillna(df1[9].mean(), inplace=True)
df1[10].fillna(df1[10].mean(), inplace=True)
df1[11].fillna(df1[11].mean(), inplace=True)
df1[12].fillna(df1[12].mean(), inplace=True)
df1[13].fillna(df1[13].mean(), inplace=True)
df1[14].fillna(df1[14].mean(), inplace=True)
df1[15].fillna(df1[15].mean(), inplace=True)
df1[16].fillna(df1[16].mean(), inplace=True)
df1[17].fillna(df1[17].mean(), inplace=True)
df1[18].fillna(df1[18].mean(), inplace=True)
df1[19].fillna(df1[19].mean(), inplace=True)
df1[20].fillna(df1[20].mean(), inplace=True)
df1[21].fillna(df1[21].mean(), inplace=True)
df1[22].fillna(df1[22].mean(), inplace=True)
df1[23].fillna(df1[23].mean(), inplace=True)
stats.kruskal(df1[0], df1[1], df1[2], df1[3], df1[4], df1[5], df1[6], df1[7], df1[8], df1[9], df1[10], df1[11], df1[12],
df1[13], df1[14], df1[15], df1[16], df1[17], df1[18], df1[19], df1[20], df1[21], df1[22], df1[23])
KruskalResult(statistic=70213661.56432353, pvalue=0.0)
This uneven distribution of ride lengths across the day is statistically significant for both groups - They don't ride equally long at all times of day. Rents in the early morning tend to be longer. Especially for casual riders, rides between 5 and 10am are shorter compared to other times of day. Next, an analysis of rides depending on weekdays is performed.
fig = plt.figure()
fig.set_size_inches(10,5)
ax0 = fig.add_subplot(121)
df1=df[['weekday','ride_length_min']][df['member_casual']=='casual'].groupby(['weekday']).count().reset_index()
ax0.bar(df1['weekday'], df1['ride_length_min'], color=['grey', 'grey', 'grey', 'grey', 'grey', 'indianred', 'indianred'])
ax0.set_xticks([0, 1, 2, 3, 4, 5, 6])
ax0.xaxis.set(ticklabels=['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'])
plt.title('Amount of casual rides per weekday')
ax1= fig.add_subplot(122)
df1=df[['weekday','ride_length_min']][df['member_casual']=='member'].groupby(['weekday']).count().reset_index()
ax1.bar(df1['weekday'], df1['ride_length_min'], color=['grey', 'indianred', 'indianred', 'indianred', 'grey', 'grey', 'grey'])
ax1.set_xticks([0, 1, 2, 3, 4, 5, 6])
ax1.xaxis.set(ticklabels=['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'])
plt.title('Amount of member rides per weekday')
fig.tight_layout()
plt.show()
#Mann-Whitney-U-test
statistic, p = mannwhitneyu(df['weekday'][df['member_casual'] == 'member'], df['weekday'][df['member_casual'] == 'casual'])
print("Test statistic:", round(statistic,1),", p value:", round(p, 2))
Test statistic: 3593364632746.0 , p value: 0.0
This indicates that casual riders mainly rent bikes on weekends (propably mainly for recreational purposes) while members mainly rent bikes during the week (propably mainly for commuting). The Mann-Whitney-U-test confirms that both groups rent differently across the week. The plots above depict the amount of rents. In the visualization below, the average length of rents per weekday will be determined.
#Average ride length for members and casual riders by day of week
fig=plt.figure()
fig.set_size_inches(10,5)
ax0=fig.add_subplot(121)
df[['member_casual', 'weekday', 'short_rides', 'medium_rides', 'long_rides', 'ride_length_min']][df['member_casual']=='casual'].groupby(['weekday']).mean().plot(
kind='line', color=['slategrey', 'skyblue', 'cadetblue', 'indianred'], ax=ax0)
plt.title('Ride length for casual riders by day of week')
ax0.get_legend().remove()
plt.ylim([0,70])
ax0.set_xticks([0, 1, 2, 3, 4, 5, 6])
ax0.xaxis.set(ticklabels=['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'])
ax1=fig.add_subplot(122)
df[['member_casual', 'weekday', 'short_rides', 'medium_rides', 'long_rides', 'ride_length_min']][df['member_casual']=='member'].groupby(['weekday']).mean().plot(
kind='line', color=['slategrey', 'skyblue', 'cadetblue', 'indianred'], ax=ax1)
plt.legend(labels=['short','medium','long','total'], loc='upper right')
plt.title('Ride length for member riders by day of week')
plt.ylim([0,70])
ax1.set_xticks([0, 1, 2, 3, 4, 5, 6])
ax1.xaxis.set(ticklabels=['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'])
fig.tight_layout()
plt.show()
Even though members rent more frequently during weekdays, the average ride length is longer on weekends, both for casual and member riders. Following up, two Kruskal-Wallis-tests (nonparametric version of ANOVA) will be performed to test whether the ride length is significantly longer on weekends, both for member and casual riders.
#Kruskal-Wallis-test casual riders
df1=df[['weekday', 'ride_length_min']][df['member_casual']=='casual'].pivot(columns='weekday', values='ride_length_min')
df1[0].fillna(df1[0].mean(), inplace=True)
df1[1].fillna(df1[1].mean(), inplace=True)
df1[2].fillna(df1[2].mean(), inplace=True)
df1[3].fillna(df1[3].mean(), inplace=True)
df1[4].fillna(df1[4].mean(), inplace=True)
df1[5].fillna(df1[5].mean(), inplace=True)
df1[6].fillna(df1[6].mean(), inplace=True)
stats.kruskal(df1[0], df1[1], df1[2], df1[3], df1[4], df1[5], df1[6])
KruskalResult(statistic=9520147.213013094, pvalue=0.0)
#Kruskal-Wallis-test member riders
df1=df[['weekday', 'ride_length_min']][df['member_casual']=='member'].pivot(columns='weekday', values='ride_length_min')
df1[0].fillna(df1[0].mean(), inplace=True)
df1[1].fillna(df1[1].mean(), inplace=True)
df1[2].fillna(df1[2].mean(), inplace=True)
df1[3].fillna(df1[3].mean(), inplace=True)
df1[4].fillna(df1[4].mean(), inplace=True)
df1[5].fillna(df1[5].mean(), inplace=True)
df1[6].fillna(df1[6].mean(), inplace=True)
stats.kruskal(df1[0], df1[1], df1[2], df1[3], df1[4], df1[5], df1[6])
KruskalResult(statistic=13840434.132392252, pvalue=0.0)
For both groups, average ride length significantly differs across weekdays. As shown in the plots above, average rides on weekends are longer for both groups.
Next, different renting behavior of members and casual riders during the different seasons is examined.
fig = plt.figure()
fig.set_size_inches(10,5)
ax0 = fig.add_subplot(121)
df1=df[['month','ride_length_min']][df['member_casual']=='casual'].groupby(['month']).count().reset_index()
ax0.bar(df1['month'], df1['ride_length_min'], color=['grey', 'grey', 'grey', 'grey', 'grey', 'indianred', 'indianred', 'indianred', 'indianred', 'grey', 'grey', 'grey'])
ax0.set_xticks([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])
ax0.xaxis.set(ticklabels=['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'])
plt.title('Amount of casual rides per month')
ax1= fig.add_subplot(122)
df1=df[['month','ride_length_min']][df['member_casual']=='member'].groupby(['month']).count().reset_index()
ax1.bar(df1['month'], df1['ride_length_min'], color=['darksalmon', 'darksalmon', 'darksalmon', 'darksalmon', 'indianred', 'indianred', 'indianred', 'indianred', 'indianred', 'indianred', 'darksalmon', 'darksalmon'])
ax1.set_xticks([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])
ax1.xaxis.set(ticklabels=['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'])
plt.title('Amount of member rides per month')
fig.tight_layout()
plt.show()
The curve of member rent frequency is wider than that of casual riders. Casual riders mainly rent during summer months (Jun - Sep), while members also rent more frequently during spring and autumn. Since summer rental occupancy is not a problem, the following unpaired t-test text is calculated to check whether one of the groups rents significantly more often in winter. A t-test is used as the plots show approximately normaly distributed variables.
#unpaired t-test
summary, results = rp.ttest(group1= df['month'][df['member_casual'] == 'member'], group1_name= "Member",
group2= df['month'][df['member_casual'] == 'casual'], group2_name= "Casual")
display(results.iloc[0:4])
| Independent t-test | results | |
|---|---|---|
| 0 | Difference (Member - Casual) = | -1.413000e-01 |
| 1 | Degrees of freedom = | 5.901460e+06 |
| 2 | t = | -6.611740e+01 |
| 3 | Two side test p value = | 0.000000e+00 |
The distribution of rents over the year significantly differs for member and casual riders. The following more global Chi Square test and plot facilitate an easier interpretation:
#Contingency table
seasons=df[['ride_length_min','month','member_casual']].groupby(['member_casual','month']).count().rename(columns={'ride_length_min':'frequency'}).reset_index().set_index('member_casual')
seasons=seasons.pivot(columns='month',values='frequency').reset_index()
winter1=seasons.set_index('member_casual')
winter2=winter1.transpose()
winter2['member_ratio']=winter2['member']/(winter2['member']+winter2['casual'])
seasons['winter']=seasons[[1, 2, 3, 10, 11, 12]].sum(axis=1)
seasons['summer']=seasons[[4, 5, 6,7,8,9]].sum(axis=1)
seasons.drop(columns=[1,2,3,4,5,6,7,8,9,10,11,12], inplace=True)
seasons.set_index('member_casual', inplace=True)
display(seasons)
#Computing test statistic
stat, p, dof, expected = chi2_contingency(seasons)
alpha = 0.05
print("p value is " + str(p))
if p <= alpha:
print('Dependent (reject H0) - the variables have a significant relation')
else:
print('Independent (H0 holds true)')
#Figure
fig=plt.figure()
fig.set_size_inches(10,5)
ax0=fig.add_subplot(121)
seasons.plot(kind='bar', color=['indianred', 'grey'], ax=ax0)
plt.title('Amount of rents per season')
ax1=fig.add_subplot(122)
winter2=winter1.transpose()
winter2['member_ratio']=winter2['member']/(winter2['member']+winter2['casual'])
winter2['member_ratio'].plot(kind='bar', color=['indianred', 'indianred', 'indianred', 'darksalmon', 'grey', 'grey', 'grey', 'grey', 'grey', 'darksalmon', 'indianred', 'indianred'], ax=ax1)
ax1.set_xticks([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11])
ax1.xaxis.set(ticklabels=['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'])
plt.title('Ratio of members per month')
fig.tight_layout()
plt.show()
| month | winter | summer |
|---|---|---|
| member_casual | ||
| casual | 563726 | 1958499 |
| member | 1178438 | 2200799 |
p value is 0.0 Dependent (reject H0) - the variables have a significant relation
These figures show it quite impressively: Especially in winter months, members rent (significantly) more often than casual riders. E.g., in January, and February, more than 80% of rents are made by members. The plots above again surveyed the amount of rents. Below, the average length of rents over the year is examined.
#Average ride length for members and casual riders by month
fig=plt.figure()
fig.set_size_inches(10,5)
ax0=fig.add_subplot(121)
df[['member_casual', 'month', 'short_rides', 'medium_rides', 'long_rides', 'ride_length_min']][df['member_casual']=='casual'].groupby(['month']).mean().plot(
kind='line', color=['slategrey', 'skyblue', 'cadetblue', 'indianred'], ax=ax0)
plt.title('Ride length for casual riders by month')
ax0.get_legend().remove()
plt.ylim([0,80])
ax0.set_xticks([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])
ax0.xaxis.set(ticklabels=['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'])
ax1=fig.add_subplot(122)
df[['member_casual', 'month', 'short_rides', 'medium_rides', 'long_rides', 'ride_length_min']][df['member_casual']=='member'].groupby(['month']).mean().plot(
kind='line', color=['slategrey', 'skyblue', 'cadetblue', 'indianred'], ax=ax1)
plt.legend(labels=['short','medium','long','total'], loc='upper right')
plt.title('Ride length for member riders by month')
plt.ylim([0,80])
ax1.set_xticks([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])
ax1.xaxis.set(ticklabels=['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'])
fig.tight_layout()
plt.show()
This curve is quite surprising - as expected, mean rent time is slightly longer during summer months for both, members and casual riders. However, both long rent curves peak in January. Perhaps renters are too impatient when it's cold and don't wait for the lease to end, allowing mistakes to happen and rents to continue when they should be terminated?
#Kruskal-Wallis-test casual riders
df1=df[['month', 'ride_length_min']][df['member_casual']=='casual'].pivot(columns='month', values='ride_length_min')
df1[1].fillna(df1[1].mean(), inplace=True)
df1[2].fillna(df1[2].mean(), inplace=True)
df1[3].fillna(df1[3].mean(), inplace=True)
df1[4].fillna(df1[4].mean(), inplace=True)
df1[5].fillna(df1[5].mean(), inplace=True)
df1[6].fillna(df1[6].mean(), inplace=True)
df1[7].fillna(df1[7].mean(), inplace=True)
df1[8].fillna(df1[8].mean(), inplace=True)
df1[9].fillna(df1[9].mean(), inplace=True)
df1[10].fillna(df1[10].mean(), inplace=True)
df1[11].fillna(df1[11].mean(), inplace=True)
df1[12].fillna(df1[12].mean(), inplace=True)
stats.kruskal(df1[1], df1[2], df1[3], df1[4], df1[5], df1[6], df1[7], df1[8], df1[9], df1[10], df1[11], df1[12])
KruskalResult(statistic=22081591.77716542, pvalue=0.0)
#Kruskal-Wallis-test member riders
df1=df[['month', 'ride_length_min']][df['member_casual']=='member'].pivot(columns='month', values='ride_length_min')
df1[1].fillna(df1[1].mean(), inplace=True)
df1[2].fillna(df1[2].mean(), inplace=True)
df1[3].fillna(df1[3].mean(), inplace=True)
df1[4].fillna(df1[4].mean(), inplace=True)
df1[5].fillna(df1[5].mean(), inplace=True)
df1[6].fillna(df1[6].mean(), inplace=True)
df1[7].fillna(df1[7].mean(), inplace=True)
df1[8].fillna(df1[8].mean(), inplace=True)
df1[9].fillna(df1[9].mean(), inplace=True)
df1[10].fillna(df1[10].mean(), inplace=True)
df1[11].fillna(df1[11].mean(), inplace=True)
df1[12].fillna(df1[12].mean(), inplace=True)
stats.kruskal(df1[1], df1[2], df1[3], df1[4], df1[5], df1[6], df1[7], df1[8], df1[9], df1[10], df1[11], df1[12])
KruskalResult(statistic=29108692.969542053, pvalue=0.0)
Once again, the uneven distribution of mean ride length across the year is significant for both groups, member and casual riders.
#generating dfs for plot
df1=df[['rideable_type', 'ride_length_min']].groupby(['rideable_type']).count().reset_index()
df2=df[['member_casual', 'rideable_type', 'ride_length_min']].groupby(['member_casual', 'rideable_type']).count().reset_index()
df2.rename(index={4:5},inplace=True)
df2.loc[4] = ['member', 'docked_bike', 0]
df2 = df2.sort_index()
df2=df2.groupby(by=['member_casual', 'rideable_type']).sum().reset_index().set_index('member_casual')
df2=df2.pivot(columns='rideable_type', values='ride_length_min')
#plotting
fig=plt.figure()
fig.set_size_inches(10,5)
ax0=fig.add_subplot(121)
ax0.bar(df1['rideable_type'], df1['ride_length_min'], color=['indianred', 'darksalmon', 'grey'])
ax0.set_xticks([0, 1, 2])
ax0.xaxis.set(ticklabels=['Classic', 'Docked', 'Electric'])
plt.title('Number of rents per category')
ax1=fig.add_subplot(122)
df2.plot(kind='bar', ax=ax1, color=['indianred', 'darksalmon', 'grey'])
plt.legend(loc="upper left", labels=['Classic','Docked','Electric'])
ax1.set_xticks([0, 1])
ax1.xaxis.set(ticklabels=['Casual', 'Member'])
plt.title('Rents per category and membership status')
plt.tight_layout()
plt.show()
Consequently, docked bikes are the least rented category while classic bikes are rented most often. This particularly seems to be the case for members. However, not a single rent event where a member rented a docked bike appeared in the data set. To check whether members and casual riders statistically differ in the rented bike categories, a Chi Square test is performed next.
#Chi Square test
display(df2)
#computing test statistic
stat, p, dof, expected = chi2_contingency(df2)
alpha = 0.05
print("p value is " + str(p))
if p <= alpha:
print('Dependent (reject H0) - the variables have a significant relation')
else:
print('Independent (H0 holds true)')
| rideable_type | classic_bike | docked_bike | electric_bike |
|---|---|---|---|
| member_casual | |||
| casual | 1132892 | 226728 | 1162605 |
| member | 1922749 | 0 | 1456488 |
p value is 0.0 Dependent (reject H0) - the variables have a significant relation
Once again, this Chi Square test is significant, meaning that the differences between bike category by member and casual riders outlined above are statistically proven.
#bike type per season
df2=df[['month', 'rideable_type', 'ride_length_min']].groupby(['month', 'rideable_type']).count().reset_index()
df2=df2.groupby(by=['month', 'rideable_type']).sum().reset_index().set_index('month')
df2=df2.pivot(columns='rideable_type', values='ride_length_min')
df2.plot(kind='line', color=['indianred', 'darksalmon', 'grey'])
plt.legend(loc="upper left", labels=['Classic','Docked','Electric'])
plt.xticks(ticks=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12],
labels=['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'])
plt.title('Rents per category and month')
Text(0.5, 1.0, 'Rents per category and month')
display(df2)
#computing test statistic
stat, p, dof, expected = chi2_contingency(df2)
alpha = 0.05
print("p value is " + str(p))
if p <= alpha:
print('Dependent (reject H0) - the variables have a significant relation')
else:
print('Independent (H0 holds true)')
| rideable_type | classic_bike | docked_bike | electric_bike |
|---|---|---|---|
| month | |||
| 1 | 55067 | 961 | 47741 |
| 2 | 59414 | 1361 | 54834 |
| 3 | 134439 | 8358 | 141245 |
| 4 | 166712 | 12116 | 192421 |
| 5 | 324046 | 26409 | 284403 |
| 6 | 406660 | 30640 | 331904 |
| 7 | 373173 | 31055 | 419260 |
| 8 | 503033 | 45065 | 256254 |
| 9 | 462284 | 35337 | 258526 |
| 10 | 316139 | 22884 | 292203 |
| 11 | 154039 | 7614 | 198325 |
| 12 | 100635 | 4928 | 141977 |
p value is 0.0 Dependent (reject H0) - the variables have a significant relation
From October to July, classic and electric bikes were comparably popular. However, in August and September, popularity of electric bikes decreased while classic bikes reached their annual maxmimum renting frequency. Docked bikes were rented more often during summer months.
Lastly, a geospatial comparison of member and casual riders will be performed.
# create a map of centered around Chicago
rent_map = folium.Map(location=[41.8, -87], zoom_start=9, tiles='CartoDB positron')
#casual riders
#creating df
df_map_c=df[['round_start_lat', 'round_start_lng','ride_length_min']][df['member_casual']=='casual'].groupby(by=['round_start_lat', 'round_start_lng']).count()
df_map_c.rename(columns={'ride_length_min':'frequency'}, inplace=True)
df_map_c.reset_index(inplace=True)
# initiate a feature group for the rents in the dataframe
rents_c = folium.map.FeatureGroup()
# loop through the rents and add each to the rents feature group
for lat, lng, size in zip(df_map_c.round_start_lat, df_map_c.round_start_lng, df_map_c.frequency):
rents_c.add_child(
folium.features.CircleMarker(
[lat, lng],
radius=size/4000,
color='indianred',
opacity=0.3,
fill=True,
fill_color='indianred',
fill_opacity=0.3
)
)
# add rents to map
rent_map.add_child(rents_c)
#members
#creating df
df_map_m=df[['round_start_lat', 'round_start_lng','ride_length_min']][df['member_casual']=='member'].groupby(by=['round_start_lat', 'round_start_lng']).count()
df_map_m.rename(columns={'ride_length_min':'frequency'}, inplace=True)
df_map_m.reset_index(inplace=True)
# initiate a feature group for the rents in the dataframe
rents_m = folium.map.FeatureGroup()
# loop through the rents and add each to the rents feature group
for lat, lng, size in zip(df_map_m.round_start_lat, df_map_m.round_start_lng, df_map_m.frequency):
rents_m.add_child(
folium.features.CircleMarker(
[lat, lng],
radius=size/4000,
color='grey',
opacity=0.3,
fill=True,
fill_color='grey',
fill_opacity=0.3
)
)
# add rents to map
rent_map.add_child(rents_m)
This map shows that both member and casual riders rent all over Chicago. However, member riders tend to rent a bit more in the Northside and Woodlawn suburb, while casual riders rent more bikes in the harbour area. Maybe, casual riders are mostly tourists using the bikes for exploring and sight seeing.